I> 



0\ 



o 
"o 



CT 1 - 



Information theory, multivariate dependence, 
and genetic network inference 



q . Dya Nemenman 

f^i ■ Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106 
fN| \ |nemenman@kitp . ucsb . edu| 

3 



Abstract 



We define the concept of dependence among multiple variables using 
maximum entropy techniques and introduce a graphical notation to de- 
note the dependencies. Direct inference of information theoretic quan- 
tities from data uncovers dependencies even in undersampled regimes 
q ■ when the joint probability distribution cannot be reliably estimated. The 

• i— i \ method is tested on synthetic data. We anticipate it to be useful for infer- 

t~ • ence of genetic circuits and other biological signaling networks. 

cr 

1 Two problems 

— l i 

> 

ly-v ■ One of the most active fields in quantitative biology is the inference of biological interaction 

networks (e. g., protein or genetic regulatory networks) from high throughput data such as 
expression microarrays Qfl 1 . In these problems, one measures (simultaneous or serial) 

\Q • values of expressions of genes under different conditions and treats them as samples from 

a joint probability distribution (PD). The goal is to infer the genetic network based on 
statistical dependencies in this PD. 



This involves a conceptual and a technical problem. First, surprisingly, even now there is 
still no agreement on what the dependence, the interaction, is in a multivariate setting. In- 



^O , stead of a universal definition, standard statistical methods [2, 3] have produced a multitude 



of dependence concepts applicable in restricted contexts, such as normal noise, binary, bi- 
variate, or metric data, etc. Of these, the notion of conditional (in)dependence in the form 
£> . of Bayesian networks (BN) [1] has proved to be very useful in biological applications. 

J^j ' However, it is insufficient to deal with regulatory loops, or to distinguish independent vs. 

cooperative regulation of a gene by others (see below). Further, statistical dependence is a 
symmetric property |4], while graphs of BNs are directed. Thus, to infer interaction net- 
works, we must first carefully define what we mean by multivariate statistical dependence. 

The goal is to partition the overall statistical dependence, that is, the deviation of the joint 
PD (JPD) from the product of its marginals, into contributions from interactions of different 
kinds (among pairs of variables, triplets, etc.), and, better yet, from specific combinations 
of variables within a kind. Many ideas have been suggested l2J|5l|6j], but the most natural 

'The literature on reverse engineering biological networks develops quickly, and we do not try 
to provide a exhaustive bibliography. On the other hand, since this paper focuses on fundamental 
concepts of statistical dependence, a considerable effort has been expended to make the relevant part 
of the bibliography complete. 



approach is to quantify the new knowledge that comes from looking at a complete JPD vs. 
its approximations under various independence assumptions. For example, in contingency 
tables analysis, one studies deviations of the number of observed counts from their expec- 
tations under such assumptions I7tl8l.l9l Uollllll . Such discussion is limited to categorical 
data and, importantly, confounds the definition of dependence with sampling issues. In- 
formation theory [ 12, 13] provides tools to treat continuous and categorical data uniformly 
1141 U5I1 and to generalizes bivariate dependence measures to multivariate cases based on 
distributions rather than counts. However, some suggested information theoretic measures 
lla. H7I HH Il9t l2(A 121. 22, 23, 24, 25, 26] are not necessarily nonnegative, or involve 
averaging logarithms of fractional powers of PDs. Thus they cannot characterize sizes of 
typical sets and have not become universally accepted. Instead, one notices that the max- 
imum entropy (MaxEnt) distributions 1271 12811 consistent with some marginals of the JPD 
introduce no statistical interactions beyond those in the said marginals. Thus the JPD can be 
compared to its MaxEnt approximations to separate dependencies included in the low order 
statistics from those not present in them El 12311 . This approach completely characterizes 
multivariate interactions for binary 1 29, 30] or exponential [31] distributions. In general, it 
led to a definition of connected interactions of a given order, that is, the interactions that 
need, at least, the full set of marginals of this order to be captured [32]. However, there 
has been no successful attempt of defining dependence among variables, that is, localizing 
(connected) interaction to particular covariates. 

The second, technical, problem emerges if we agree on the measures of dependence. The 
usual approach then is to use the experimental data to infer the JPD in question and evaluate 
these measures for it 2 . Unfortunately, the distributions may be severely undersampled. 
To fight this, the field has focused on a simplistic approach 1 33, 34]: (a) assume some 
dependence structure; (b) regularize the JPD consistently with it and learn it from the data; 
(c) evaluate the quality of fit; (d) repeat until the dependence structure with the best fit 
is found. Validity of such analysis is sensitive to the choice of the regularization, and a 
bad choice may lead to misestimation of dependencies. As a rule, inferring a complicated 
object (the JPD) in order to only find its simple property (the dependencies) is not a good 
idea 13511 . and a direct estimation of dependencies without learning the JPD is preferred 3 . 

From Shannon's work 1 12] and its later developments |4, 32] we know that we have to look 
at information theoretic quantities to measure dependence. Many of these are differences 
of (marginal, conditional) entropies of the JPD. While earlier works fit [1911 relied on good 
sampling, we now know that, at least under some conditions, entropy may be estimated 
reliably even when inferences about details of the underlying PD are impossible 4 . Thus the 
direct estimation of dependencies has a chance even for undersampled problems. 

In thi s pap er we deal with both the conceptual and the technical problem. First, expanding 
on lH. 13211 (see also 1 39] for axiomatization), we systematically characterize dependencies 
among variables. Second, we apply direct entropy estimation methods to undersampled 
synthetic data to show that interactions can be uncovered even in that regime. 



2 We deliberately remain vague about the identity of the covariates. These can be the simultaneous 
gene expressions, in which case there is a big leap between inferring statistical dependencies and 
reconstructing the network. These also can be the time lagged expressions, or their whole time 
courses, which makes the reconstruction easier, or something altogether different. 

3 Direct estimation of the quantity of interest without the intermediate inference of the underlying 
PD has been useful in various contexts, in particular for estimating entropies 1 39. 13711 . 

4 The reader is referred to |38] and to menem. com/~ilya/pages/NIPS03 for overviews. 



2 Definitions 

Suppose we have a network of M covariates Xi, i = 1, . . . , M, (called nodes, expressions, 
or genes) that take random values Xi respectively with the joint probability P (x) . The total 
statistical dependence among the variables is given by their multiinformation, that is, the 
Kullback-Leibler divergence between the JPD and the product of the marginals 1 32J. I3SM 5 . 

m := E^I^JjSL . (>o 82n ^L) = E« - SIX], (1) 

where J^'s represent summations for discrete and integration for continuous variables re- 
spectively 6 , and S[X] :— — ^ P(x) log 2 P(x) is the entropy of P(x). 

Following 1 32], we localize these bits of dependence to statistical interactions of different 
orders. If all m-way marginals, P(Xi x ,Xi 2 , . . . , Xi m ), are known, then one finds an approx- 
imation to the JPD that respects the marginals, but makes no additional assumptions about 
the JPD 7 . This is given by the MaxEnt, or minimum multiinformation, problem H. 1271 13211 : 

PM := arg max As[P>] - f] £ K...i m (K...i m ~ P n-ij\, ( 2 ) 

il<---<i m Xi 1 ...x im ' 

where A's are the Lagrange multipliers enforcing the marginal constraints; different A's are 
distinguished by their arguments. The arguments of all functions are listed in their lower 
indices, e. g., Pi := P{x\). No indices means dependence on all variables, while jt on all, 
but Xi. Further, a distribution with lower indices is a marginal, e. g., P12 := J^v v P- 

A solution of any MaxEnt problem with marginal constraints has a form of a product of 
terms depending on the constrained variables B4Qll . In particular, for Eq. (0, 

p(m) = JJ f ^ ^ f > Qj (3) 

ii<---<i m 

Pt ] .i m = p ii-..i». V{h,..;imh ( 4 ) 

where /'s, again distinguished by their arguments, are to be found from the constraints. 
Note the Boltzmann machine or Markov network structure of this MaxEnt distribution 
111 Bill . In general, no analytical solution for /'s exists. However, an algorithm called the 
iterative proportional fitting procedure (IPFP) [42], which iteratively adjusts a trial solution 
to satisfy each of the constraints in turn, converges to the true solution [40]. 

Finding p( m ) and p( m_1 ) defines connected information [3 

I N / P (m) \ 

l(m) W ■= (i°g 2 T^riy ); (5) 

M 

np] - E /(m) t p ]' <® 



5 We remind the readers that the Kullback-Leibler divergence, Dkl[-P||Q] = 
^2 x P(x) log 2 P(x)IQ{x) is a natural information-theoretic measure of dissimilarity between PDs. 
It is nonsymmetric, nonnegative, and it is zero iff P — Q fPJII . 

6 In this, work we do not aim at mathematical rigor of the measure theoretic information theory. 
In particular, we assume that all quantities of interest exist for all distributions considered. 

7 All JPDs constrained by the same marginals are said to form a Frechet class |2]. For metric 
variables and simple constraints, these classes are well studied. We know parametric forms for some 
of them, can check if the constraints are compatible, and if they determine the JPD uniquely. 



which measures the amount of statistical interactions accounted for by m-way, but not by 
m — 1-way marginals. This is similar to connected correlation functions or cumulants. 

In the same spirit, to determine if a particular m-way interaction contributes to /, we may 
check if fixing the corresponding Pi 1 ...i m recovers any dependencies not already contained 
in a reference MaxEnt distribution Q*('i— *«0 constrained by some other marginals. That 
is, we define the interaction multiinformation 

I (~)(ii—i m ) \ 

A (h-i m ) / j ^_ — _^ \ = T (.h...i m ) _ P (h...i m ) (7) 

\ Q*(»l— *m) / 

where Q' ' ) is the interaction MaxEnt distribution satisfying all constraints of Q*( ' ' and 
additionally having Q\ \ ' i — Pi x ...i m . 1^' ' and J*( ' ) are multiinformations of Q( ' ) and 
Q*( ' ' respectively. By positivity of the Kullback-Leibler divergence, A' ' ) > 0. Thus if 
A' ' > > 0, accounting for the marginal Pi 1 ...i m recovers more multiinformation, and we 
say that the corresponding interaction or dependence is present with respect to Q*^'. 

The problem is that A' ' ' depends on the choice of Q* 1 - '\ To test the null hypothesis 
of no dependencies, we must select the reference Q* 1 - ' > that minimizes the interaction 
information. This guarantees that interactions are accepted only if they cannot be reduced 
to some other statistical dependencies in the network. According to Thm.^(see Appendix), 
such reference distribution is given by 

Q* (n - 4m) = &•••/«»« /^° W 

Q^- 4m) = P Vk , Vfe = l...m. (9) 

This Q*( ' ' preserves all marginals of the original JPD except those that involve all m co- 
variates being examined for an interaction. This is similar to the Type III Sum of Squares 
ANOVA for testing significance of predictors. In fact, since -Dkl is equal to \ 2 asymp- 
totically, the similarity is not accidental. Dependence defined by this choice of Q*( ' Ms a 
generalization of the conditional dependence with the rest of the network as a condition. 

The interaction PD, which additionally preserves the joint statistics of X^ , . . . , Xi m , but 
nothing extra, is 
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Using such Q and Q* in Eq. Q defines irreducible m-way interactions (dependencies) 
among particular m variables. We denote these dependencies graphically by edges coming 
from the variables and meeting at an m-edge vertex, see Fig.^ This graphical notation 
generalizes BNs and 1 32], were the only goal was to denote m-way interactions among all 
combinations of co-variates simultaneously. 

3 Examples and properties 

We consider a few examples for M — 3 (larger M is analyzed similarly). First, a regulatory 
cascade, or a Markov chain: X\ —* X 2 — > X 3 , P(xi,X2,x 3 ) = P(xi)P(x2\xi)P(x3\x2). 
Looking for X X X 2 dependence, we have I< 12 ~> = I[X U X 3 ] + I[X 2 ,X 3 ] < I^ 12 ~> = 
I[X\ , X%] + I[X2, X3], where the inequality is due to the information processing inequal- 
ity, and the bound is reached only in special cases. Thus X%, X 2 are (generically) depen- 
dent. Similarly, X 2 , X$ are dependent; but A*' 13 ) = 0, and X\, X% are not (even though 









their marginal mutual information, induced by other interactions, is not zero). Checking 
for the triplet interactions, we find /*( 123 ) = I[X 1 ,X 2 ] + I[X 2 ,X 3 ] = P 123 \ thus no 
such dependencies are present. If now instead X 2 regulates X\ and X 3 , one sees that the 
dependence structure is the same. Both networks correspond to the graph in Fig.^a). 

A more interesting case is when X\, 
X 3 regulate X 2 jointly. Here many 
possibilities exist, not all of them 
realizable in terms of BN modeling. 
First, consider independent regula- 
tion: to predict X 2 , one does not 
need to know the values of X\ and 
X 3 simultaneously, P(x 2 \xi,x 3 ) = 
/12/23, e. g., P(x 2 \xi,x 3 ) oc 
exp [— a(x2 - xi) 2 - b(x 2 - x 3 ) 2 ] 
(this corresponds to OR and AND 
gates 1 32], to the Lac-repressor |43], 
and to all regulatory models based 

on independent binding of transcrip- Figure 1: Examples of dependencies for M = 3. 
tion factors to the DNA O). If 

P13 = P1P3, then the dependency structure is again as in Fig. [Ha), If in addition there 
is a regulation X x -> X 3 , so that P 13 ^ PiP 3 , then A< 13 ' = £>kl[P||<9* (13) ] > 0, and 
A' 123 ' = 0. The dependency graph now has a loop in it, as in Fig.[Hb). 

Further, in the joint regulation case one may consider a nonfactorizable P123 with all 
pairwise marginals factorizable. An example is the XOR gate I32i l43ll (we were un- 
able to construct an explicit, normalized example for continuous variables). In this case, 
Vi,j, I™' = 0. A^ 123 ) > 0, and the dependence structure is as in Fig. ^c). Com- 
binations of two- and three-way dependencies are also possible [Fig. [Qd-f), etc.]; for 
example, an explicit construction for the case (e) is P123 oc exp[— a(x 2 — X3Zi) 2 ]PiP3. 
Such higher order dependencies are uncommon in physics, which usually deals with low 
order interactions among many variables (for example, the Hamiltonian for a spin system 
is H = — J2ij Jij&iCj', thus the JPD of spins has no interactions of order m > 2). In 
contrast, combinatorial regulation in genetics requires considering higher order models. 

While such detailed classification is overwhelming for large M, some general statements 
may be proven for our choice of Q* , Q. In particular, similar to Eq. (|6}, we have 

I[P]> J2 A( subset >. (13) 

all subsets of {Xi} 

Here the inequality is the result of our conservative approach to identification of dependen- 
cies. To prove it, we order all m-way dependencies in an arbitrary way. We then evaluate 
the interaction information for the first dependency with pi" 1 - 1 ) as the reference distri- 
bution, and take the interaction PD for each dependency as the reference PD for the next 
one. Summing all m-way interactions gives I^ m \ and summing over m results in I[P] [cf. 
Eq. (|6j]. On the other hand, according to Thm.^ A' ' ) evaluated this way is not smaller 
than the one with the references Eq. l|S}. This proves the above inequality. 

For M — 3, an interesting illustration of Eq. dl 31 is P123 = Pi 5(x\ — x 2 ) S(x 2 — x 3 ). 
We correctly identify all interactions as reducible (all A's are zero). However, to account 
for the multiinformation in this PD, (any) two pairwise interactions must be invoked. The 
degeneracy is lifted if, for example, noise breaks the symmetry among X's. 

Finally, we note another interesting property of our definition. For continuous Xi, the 
presence of interactions does not depend on (nonsingular) reparameterizations of variables 



A 


£■(23) 


[] 


□ 


£■(13) 


6ft 




V 


£(12) 






O 


£<123) 










Uj 4q 






'fig g 



o 

o 
o 


A E ' m 
a E< ,3 > 
V £«*> 
o E'" 3 ' 


o 

# ° 


8 8 


D 1 


2 



(a) 



(b) 



(0 



Figure 2: Inferring regulatory networks from N samples. We used the NSB 1 36] method to 
estimate the entropies (with error bars) of the JPD and its marginals directly. The method 
does not work for Q**- 123 ^. Thus IPFP was applied to the counts and the entropy 5"*( 123 ) 
of the solution was evaluated and extrapolated for 1/N — > following Strong et al. 1 46] to 
account for the sample size dependent bias. The statistical error for each sample size N was 
determined by bootstrapping, and the resulting extrapolation error was used for 5S*( 123 \ 
This approach works since the MaxEnt constraints, like those in Eq. l|2}, are linear in the 
unknown JPD P, making the biases of S[P] and g*( 123 ) behave similarly. Finally, A' ' ■* 
were calculated as the differences of the appropriate entropies, and <5 2 A^ ' ) as the sums of 
squares of the entropy errors, (a) Network in Fig[Qa). To the left of the vertical dotted 
line, N sw 3000 > 2 s <C K ss 125000, the sample size corrections are reliable, and 

all entropies are known well. There is evidence only for X1X2 and X2X3 interactions, 
just as it should be. For smaller N, the method of Strong et al. fails, but NSB works until 
N ~ 2 1 / 2S [ p l rs 60. Forpairwise interactions, we may replace 5'*( 123 ) by S[P] (denoted 
by smaller markers on the plot) and, since _E^ 13 ' stays zero nonetheless, and I[Xi, X^\ + 
I[X2,X§\ — I[P], we still recover the interactions correctly, (b) Network in Fig.^b). 

Again, to the left of the line, N ss 700 > 2 s , all entropies are determined reliably, 
and there is evidence for all three pairwise interaction, but not for the triplet interaction. 
To the right of the line, NSB still works, but now we cannot disentangle the loop from 
the three-way dependence without estimating 5'*( 123 ). (c) Network in Fig.^f). Only the 

regime N > 2 s rs 5000 is shown. The evidence for all three pairwise interactions 

and for the triplet interaction is barely significant for small N but grows fast. 



that do not mix them, Xi — » yi(xi) (see |2] for a discussion of importance of this): 

A(-)[P(x u ...,x M )} = & ( ' ) [P(yi,...,yM)}. (14) 

This is true since such transformations do not change factorization properties of the Max- 
Ent distributions, and the Jacobians cancel in the definition of A' ' >. Note that, while the 
definition is reparameterization invariant, inference of distributions cannot be done in a 
covariant way 14511 . The invariance vanishes if instead of P only a sample from it is given. 

4 Inferring networks from data 



A big advantage of our definition of statistical dependencies in terms of the MaxEnt ap- 
proximations is that it can be applied even when the underlying PDs are undersampled, 
and the corresponding factorizations cannot be readily observed. For K, the cardinality 
of a variable 8 , larger than the number of samples, N, we cannot estimate the PDs reli- 
ably, but entropic quantities, and, therefore, the interactions are inferable 4 (some progress 

8 In genomics, continuous expression levels are routinely discretized into three states: up, down, 
and baseline. Thus we decided to focus on the discrete case in view of its relevance and concep- 



is possible even for N ~ yK). To show this, we used Dirichlet priors |3Jfl to gener- 
ate random probability distributions with different interaction structures, M — 3, and 
with marginal cardinalities K t w 50. We generated random samples of different sizes, 
N = 50 . . . 125000 from these distributions and tested the quality of inference of the de- 
pendencies as a function of N. To measure it, we used the evidence for some dependency, 
2?( ' ) := A' ' '/6A( ' \ where SA^ ' ' is the statistical error of the interaction multiinforma- 
tion. If 2jK ' ) is large, the dependency is present. According to Fig. |2] proper recovery is 
possible for N <C K — K\K%K$ with few assumptions about the form of the PDs. 



With modern entropy estimation techniques 1 36], our approach will work even for severely 
undersampled JPD. The bottleneck is the estimation of the maximum entropy consistent 
with the marginals, which currently requires substantial sampling of the marginals 1 46]. 
This is encouraging, since they may be well sampled when the JPD is not. However, it is 
still essential to develop techniques to infer maximum entropies directly. Further, the inter- 
action information is the difference of entropies. It may be small when its error, which is 
a quadratic sum of the entropy errors, is large. This leads to uncertainties about dependen- 
cies even for reliably estimated entropies, see the small N region in Fig.|2jc). Therefore, a 
method that directly estimates A will be preferred over another entropy-based technique. 
Finally, as in Fig.^a), variables may have nonzero mutual (or higher order) information 
and no direct interactions. Thus, if X2 was unobserved, we would have inferred a depen- 
dence between X\ and X3. Similarly, spurious higher order interactions may also emerge. 
Our method, just like most other assumption-free methods, may fail for hidden variables. 

For genomic applications, the number of different expression measurements is N < 100, 
and it is not nearly enough to estimate A's and to infer the full interaction network of, 
say, M w 6000 genes in a yeast. However, for ternary discretization of expressions, with 
the Strong et al. entropy estimation, one will not find significant evidence for 1^ beyond 
m* ~ log 3 N ss 4 (or somewhat larger if PDs are far from being uniform). Then one can 
replace P by P^ m ' in Eq. l|8} and study interactions up to the order m* with respect to this 
JPD. It is possible that most interactions in genomes are of such low orders. Additionally, 
if methods like NSB 1 36] are developed for MaxEnt analysis, one should be able to push 
for m* ~ 2 log 3 N rs 8, and this is the primary goal of our future work. 

In summary, we have formalized the concept of multivariate dependence, suggested a way 
to infer dependencies from data, tested the suggestion on undersampled synthetic examples, 
and hinted at possible applications to genomic research. 

Appendix 

Theorem 1 Let {C} be a set of noncontradictory marginal constraints and Q c be the 
MaxEnt distributions satisfying these constraints. Further, let Cq and C\ be additional 
constraints (possibly included in {C}), and Q , Q C1 , and Q be the MaxEnt PDs 
satisfying {C} U Cq, {C} U C\, and {C} U Cq U C\ respectively. Then 

^ C01 \<7, ® C1 
W - \ log2 Q" 



l0 S2 -TyjO ) ^ \ l0 §2 -TyT ) > < 15 > 



where the averaging is performed over Q C01 . 

Intuitively, this says that the interaction multiinformations depend on the order in which 
the interactions are considered. Dependency bits will be accounted for by the first marginal 
able to explain them, attributing less bits to later constraints. At present, this theorem has 
been extensively tested by numerical simulations, but still remains a conjecture. 



tual simplicity. Measuring dependencies for continuous variables follows a similar route, with the 
estimation of entropies performed by one of the many methods reviewed in [38]. 
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